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1 .   INTRODUCTION 

This  paper  reports  a  numerical  experiment  with  an  approach  to  inversion 
of  a  real  matrix  using  a  block  partitioning  structure.  The  study  arises  in 
the  context  of  design  of  a  large  scale  mathematical  programming  system  for 
use  within  various  computer  environments.  The  scheme  permits  controlled 
explicit  pagination  of  mathematical  operations  to  coincide  with  boundaries 
specified  by  hardware  memory  management.  The  timing  results  presented  in- 
dicate that  maximizing  work-per-page  does  not  necessarily  minimize  total 
execution  time  as  folklore  would  advise.  Further,  performance  of  an 
inversion  scheme  such  as  this  is  not  always  adequately  estimated  by 
classical  means.  The  implications  even  for  the  simple  cases  reported  here 
are  potentially  of  importance  to  our  further  design  work. 

Obtaining  the  explicit  numerical  inverse  of  a  real  matrix  presents  a 
classic  problem  in  numerical  methods  which  has  received  intense  study  in  the 
literature;  the  wide  spectrum  of  applications  requiring  inversion  of  matrices 
bears  testimony  to  its  fundamental  nature.  Numerous  algorithms  for  matrix 
inversion  have  been  presented  and  analyzed  for  error  and  speed  [8,12,31]  as 
have  methods  for  error  reduction  by  pivot  selection  and  scaling  [1,9], 
iterative  improvement  of  accuracy  [33],  and  exploitation  of  special  features 
in  the  matrix  to  be  inverted,  especially  sparseness  [7,22,23,24,28,30,34,35], 
symmetry  and  band  structure  [26].  Studies  of  the  form  and  complexity  of  the 
inversion  process  have  included  graph  theoretic  descriptions  [4,28],  statis- 
tical characterizations  [19],  exploitation  of  algorithmic  parallelism  [20], 
exercising  special  features  of  multiprogramming  environments  [17],  and  so 
forth. 


One  of  the  active  application  areas  for  numerical  matrix  inversion  has 
been  in  large  scale  mathematical  programming.  Most  successful  codes  for 
large  problems  must  resort  to  some  form  of  inversion  (and  often,  reinversion) 
technique  based  either  on  special  structure  in  the  problem  recognized  a 
priori,  or  on  special  storage  mechanisms  for  the  inverse  matrix.  The  moti- 
vation for  these  efforts  is  that  the  available  main  memory  on  modern  digital 
computers  is  not  sufficiently  large  to  store  the  inverse  explicitly  for 
problems  of  contemporary  scale  (say,  thousands  of  rows). 

Unfortunately,  many  methods  based  on  a  priori  structure  in  problems  are 
inherently  ad  hoc  in  nature  -  decomposition  methods  [2,6,11,25]  are  con- 
sidered by  many  to  exemplify  this  shortcoming.  On  the  other  hand,  for  some 
important  classes  of  problems  sharing  a  common  special  structure,  new  tech- 
niques developed  in  concert  with  new  data  structures  for  problem  represen- 
tation have  led  to  remarkable  breakthroughs.  In  fact,  sometimes  a 
triangulated  basis  is  superior  to  an  explicit  inverse.  As  an  example,  a 
primal  (simplex)  network  code  [3]  has  triangulated  node-arc  incident  matrices 
of  rank  10,000  in  20  seconds  (IBM  360/67)  with  no  rounding  error.  Other 
successful  special  methods  have  included  element  generation  by  generalized 
upper  bounding  [5],  factorization  [15],  and  other  compact  working  inverse 
basis  methods  [14]. 

Special  storage  mechanisms  for  the  inverse  have  typically  included 
columnwise  product  form  representation  (with  "ETA"  columns)  [18]  and  modifi- 
cations attempting  to  maintain  sparcity  in  the  inverse  [13].  Also,  inversion 
and  manipulation  of  sparce  matrices  is  possible  based  on  other  structural 
decompositions,  including  doubly  linked  lists,  bit  maps,  etc.  [22,29]. 

The  introduction  of  computers  with  memory  organized  in  pages  which  are 
transferred  in  bulk  swapping  transactions  between  high  speed  magnetic  devices 


and  main  memory  has  led  many  to  conclude  that  the  vastly  increased  virtual 
memory  capacity  will  allow  design  of  large  scale  algorithms  ignorant  of 
main,  or  cache,  memory  configurations.  Unfortunately,  algorithms  designed 
originally  for  conventional  systems  have  performed  very  poorly  in  the  paged 
environment.  The  study  of  dynamic  program  behavior  within  paging  systems 
indicates  that  the  global  cost  of  page  swaps  is  far  from  negligible.  Thus, 
with  a  given  amount  of  work  to  perform  in  inverting  a  matrix,  design 
criteria  now  include  consideration  of  both  minimizing  page  swaps  and 
maximizing  the  work  performed  on  each  page  during  its  cache  residence  [16, 
17]. 

This  paper  presents  an  approach  to  inverting  a  real  matrix  using  a 
partitioning  structure  and  block  pivots  ("B  pivots")  which  perform  bulk 
inversions  of  interior  submatrices  [21,27].  The  technique  is  demonstrated 
to  have  theoretical  computational  efficiency  in  terms  of  long  (floating 
point)  operations  equal  to  classical  methods,  and  parametric  high  resolution 
timing  of  a  FORTRAN  routine  executed  on  a  dedicated  processor  with  blocks  of 
varied  size  is  given,  with  some  discussion  of  surprising  behavior. 

This  scheme  employs  explicit  pagination  of  the  mathematical  operations 
in  inversion  and  is  potentially  useful  for  large  scale  mathematical  programming. 
The  reason  that  the  empirical  study  concentrates  on  a  dedicated  processor  is 
two-fold.  For  contemporary  large  scale  programming  applications,  the  wall 
clock  time,  rather  than  central  processor  active  compute  time,  often  dominates 
the  attention  of  the  user  -  these  applications  are  capable  of  laying  seige  to 
the  entire  computer  system,  whether  paged  or  not.  Second,  we  are  interested 
in  the  performance  characteristics  of  various  numerical  techniques  imbedded 
in  a  large  scale  mathematical  programming  code  now  under  development.  The 
design  of  the  new  system,  based  on  parametric  studies  such  as  this,  includes 

3 


the  provision  for  robust  performance  within  varied  computer  environments. 

Similarly,  the  examples  are  chosen  to  be  full  rank  and  dense  in  the 
interests  of  emphasizing  the  execution  performance  of  the  partitioned 
algorithm,  rather  than  that  of  exercising  some  supporting  storage  structure, 
list  mechanism,  or  other  feature  not  under  study  here.  Even  for  large, 
sparce  problems,  however,  nonzero  elements  can  often  be  aggregated  to  dense 
blocks  over  subsets  of  the  matrix  rows  and  columns. 

The  advent  of  minicomputers,  micro  computers,  and  various  multiple 
processor  configurations  portends  further  application  of  B-pivot  inversion 
even  for  relatively  small  or  intermediate  scale  problems. 


2.  INVERSION  TRANSFORMATION 

The  classical  inversion  method  used  here  is  in-situ  Gauss-Jordan 
pivoting,  chosen  for  speed,  generality,  and  storage  economy.  To  illustrate, 
let  A  be  the  nonsingular  real  square  matrix  of  rank  n  to  be  inverted, 
with  a.,  a  general  element  of  A.  An  in-situ  scaler  pivot  on  a.   is 
defined  as  follows  for  a.  f     0  with  "«-"  indicating  elemental  replace- 
ment: 

ak£  *  ]  /  ak£  (Pivot  element)  (1) 

akj  "*"  akj  f     \ll   Jsl»--»ni  J?1     (pivot  row) 
aU  *"  ~aU  /  ak£;  i=1"-'n'  ^k  (pivot  column) 

aij  *  aij  "  ai£akj  7  akj^;  i=l,..,n;  i7k  (general 

j=l,..,n;  j>£.  elements) 

An  inverse  is  developed  in  place  by  performing  n  sequential  pivotal 
transformations  of  this  type  on  the  main  diagonal  elements  a  ,  p=l,..,n. 
However,  since  the  accuracy  of  the  computations  can  be  adversely  affected  by 
the  relative  magnitude  of  the  elements  in  A  ,  and  since  the  transformation 
is  not  defined  for  a.  =  0  ,  a  pivot  selection  strategy  is  usually  employed 
in  inversion. 

Notably,  "partial"  pivots  use  sequential  rows  for  k  and  in  each  select 
l    by  finding  the  largest  absolute  element  in  a  previously  unused  column. 
"Full"  pivots  select  the  largest  absolute  matrix  element  remaining  in  an 
unused  row  and  column.  Often  equilibration,  or  scaling,  of  the  matrix  columns 
for  partial  pivoting,  and  of  rows  and  columns  for  full  pivoting  can  further 
increase  accuracy. 


Both  partial  and  full  pivot  strategies  require  that  the  inverse  be 
recovered  by  recording  the  coordinates  for  each  pivot  and  using  this  history 
to  permute  rows  and  columns  of  the  result  matrix  left  by  pivoting.  If 
r  =  k  and  c  =  £  for  pivot  p,  p=l,..,n  ,  then  inverse  row  c.  is 
located  in  matrix  row  r.  ,  and  inverse  column  r.  is  located  in  matrix 
column  c..  Thus,  the  increased  accuracy  of  these  pivot  selection  schemes 
comes  at  the  cost  of  extra  scanning  of  elements  in  pivot  selection  tourna- 
ments, bookkeeping  and  in  reordering  the  resulting  inverse. 

A  simple  computation  timing  estimate  for  Gauss-Jordan  pivoting  can  be 
arrived  at  by  concentrating  only  on  the  floating  point  arithmetic  required 
by  (1).L^3J  Assuming  an  efficient  program  utilizing  partial  results  in  the 
pivot  rows,  or  columns,  in  computing  the  general  elements,  a  total  of  n 

pivots  will  be  performed,  each  requiring  1  +  2(n  -  1)   multiplications  and 

o 

(n  -  1)   additions.  Aggregating  operations  for  each  pivot  gives  the  total 

W(  n  )  =  2n2  -  2n  +  1  ,  (2) 

and  a  complete  inversion  time  of  n  W(  n  ).  For  simplicity  the  effect  of 
pivot  selection  strategy  and  other  program  overhead  has  been  ignored. 

Now  let  us  introduce  B-pivots  to  the  inversion  by  partitioning  A  into 
submatrices  as  follows: 


A  = 


-1L±AL2- 

A  "T A~ 

M21  irt22 


(3) 


If    A-|i     is     b  x  b   ,  an  in-situ  B-pivot  on    A,,     is  defined  as  follows 
for  nonsingular    A,-, : 


All     "*"    All 


-1 


(B-pivot  block) 


(4) 


Ai2  *"   An     Ai2 


(B-pivot  row) 


A21     *"  'A21  All 


•1 


(B-pivot  column) 


-1 


A22    *"    A22     "  A21  All       A12     general   elements) 

Since    A-,-,"1     is  found  by  using  the  Gauss-Jordan  transformations  in  the 

first    b    rows  and  columns,  it  is  clear  that  if  we  continue  with  another 

B-pivot  in  the  next     n  -  b  rows  and  columns  of    A22»     A        will   have 

been  formed.     The  resulting  inverse  in  terms  of  the  original   elements  in 
(3)   is: 


A 


-1 


An     +  An     Ai2 B  A2i  An 


-B  A21  An 


-1 


^n"1  Ai2 B 


with 


22  "     21      11  12  ' 


and  may  be  verified  by  multiplication. 

This  B-pivot  process  may  be  continued  analytically  to  any  number    P     of 
B-pivots  of  varying  size     b   ,  p=l,..,P,  indeed,  with     b,=...=b     =  1,  and 

r  r 

P  =  n,  or  with  b-,  =  n,  and  P  =  1,  the  process  is  simply  the  Gauss-Jordan 
transformation. 

A  computation  timing  estimate  using  the  simple  approach  producing  (2) 
for  a  B-pivot  of  size  b  will  give 

b  W(b)  =  2b3  -  2b2  +  b 
operations  for  the  inversion  of  the  pivot  block  in  (4),  for  the  B-pivot  row 
b  (n  -  b)  (2b  -  1)  operations,  an  equal  number  in  the  B-pivot  column,  and 


2  2 

(n  -  b)  (2b  -  1)  +  (n  -  b)   floating  point  computations  for  the  general 

elements.  This  again  sssumes  an  efficient  program  which  uses  partial  results 
in  the  B-pivot  row,  or  column,  to  compute  general  elements.  For  the  complete 
B-pivot  we  have  the  total 

b  (2n2  -  2n  +  1)  =  bW(  n  )  .  (5) 

Thus,  the  B-pivot  approach  nominally  requires  no  more  computational  effort, 
stated  in  terms  of  floating  point  operations,  than  scaler  pivots. 

3.  APPLICATION  SCHEME 

Let  b,  =  b2=...=b  ,  ,  b  =  n  -  J,  b  .  This  "fixed  block"  strategy 

r   ' 

requires  nonsingular  A,-.,..., A   .  Applying  this  method  to  test  matrices  of 
dimensionality  n  =  10(10)50  with  b  =  1(1 )n  in  a  FORTRAN  test  program 
produced  the  timing  results  shown  in  Figures  1  and  2  when  run  on  a  dedicated 
IBM  360/67.  The  times  shown  are  for  active  computation  with  memory  resident 
matrices  and  are  precise  to  four  significant  digits.  As  indicated,  both 
partial  and  full  pivot  selection  strategies  within  pivot  blocks  were  tested. 

The  FORTRAN  program  specifies  a  matrix  of  appropriate  dimensionality, 
the  block  size  to  be  used,  and  then  exercises  subroutines  to  perform  the 
block  inversions,  row  block  transformations,  column  block  transformations, 
and  general  block  transformations.  The  program  is  compiled  and  optimized 
for  run  time  efficiency  by  the  IBM  FORTRAN  (H)  compiler. 

The  timing  prediction  of  (5)  appears  to  hold  at  least  as  a  polynomial 
form  for  the  full  block  (b  =  n)  execution  times  in  Figure  1.  However, 
note  the  surprising  gain  in  speed  for  intermediate  sizes  of  b  in  Figure  2. 
This  is  partially  due  to  the  work  avoided  in  restricting  to  the  pivot  blocks 
the  range  of  scalar  pivot  selection  and  subsequent  matrix  permutations. 
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Conversely,  for  very  small  B-pivots  and  those  requiring  a  last  B-pivot  much 
smaller  than  b,  an  "odd  B-pivot",  the  housekeeping  overhead  of  the  program 
increases  the  execution  time  significantly.  Thus,  it  may  be  advantageous  to 
choose  block  sizes  which  partition  the  problem  into  blocks  of  homogeneous 
size,  rather  than  to  fill  a  page. 

For  these  examples,  the  classical  assumption  that  long  floating  point 
operations  required  by  an  algorithm  dominate  performance  on  a  computer  appears 
to  be  contradicted.  It  is  certainly  true  that  the  time  required  for  floating 
point  operations  relative  to  others  such  as  register  loads,  compares,  and  so 
forth,  has  decreased  greatly  from  the  days  of  software  arithmetic  subroutines 
to  contemporary  floating  point  hardware  (such  as  on  the  IBM  360/67  used  here). 

As  a  practical  matter,  resident  memory  for  three  blocks  is  required  for 
each  complete  B-pivot  as  shown  in  (4).  The  program  can  easily  be  modified 
to  emulate  hardware  delays  caused  by  page  swaps,  or  other  interference  by  a 
paging  mechanism.  The  times  reported  do  not  reflect  such  modifications. 

4.  DISCUSSION 

For  matrix  inversion  and  other  matrix  arithmetic  the  "fixed  block" 
approach  allows  b  to  be  chosen  to  fit  within  a  page,  or  cache  area,  and 
minimize  boundary  violations  by  the  computation.  However,  the  folklore 
which  specifies  that  each  page  should  be  as  nearly  filled  as  possible  is  not 
necessarily  true.  The  problem  dimension  (or  other  consideration)  can  in 
some  cases  dictate  a  less  than  full  page  scheme  for  fastest  (or  best) 
execution. 

It  is  important  to  note  that  the  block  approach  generates,  a  priori, 
demands  for  the  matrix  blocks  from  the  out-of -memory  device.  Thus,  an 


agenda  of  input/output  operations  can  be  used  to  achieve  simultaneity  with 
computation.  This  can  be  much  faster  than  a  memory  page  fault  and  interrupt 
mechanism.  Even  in  traditional  single  user  environments  the  block  size  may 
be  chosen  to  balance  access-transfer  time  for  slow  magnetic  devices  with 
computation  time,  minimizing  retrieval  overhead  with  little  memory  cost. 
Note  from  Figures  1  and  2  that  even  a  relatively  small  block  pivot  requires 
sufficient  computation  time  (e.g.,  more  than  one  second  for  n  =  40)  to 
permit  simultaneous  access  to  a  disc  or  drum  device. 

The  fixed  blocks  provide  a  convenient  parcelling  of  matrix  manipulation 
effort  as  well  as  storage,  and  permit  for  many  applications  a  more  convenient 
access  structure  than  traditional  column  by  column  methods.  This  is  useful 
for  large  mathematical  programming  problems  with  block  angular,  staircase, 
or  other  special  block  structure,  [14].  These  B-pivots  on  fixed  blocks  may 
also  be  used  to  equitably  distribute  computation  among  available  parallel 
processors  in  sizeable  bulk  to  permit  relatively  lengthy  independent 
operation. 

5.  CONCLUSION 

The  usefulness  and  speed  of  B-pivot  schemes  come  with  one  important 
disadvantage:  the  pivot  blocks  must  be  nonsingular  mathematically  and  in 
the  stricter  numerical  sense.  Since  scalar  pivot  selection  is  restricted 
to  each  pivot  block,  one  should  try  to  assemble  the  matrix  A  accordingly. 
For  instance,  in  large  linear  programming  problems  a  history  of  algorithm 

progress  can  be  used  to  provide  A, , A   .  In  fact,  B-pivots  may  be 

considered  in  this  context  as  an  extension  of  the  concept  of  product  form 
inverse.  The  B-pivot  method  works  well  for  such  problems,  providing  a 
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convenient  tool  for  fast  inversion,  or  reinversion,  of  kernels  embedded  in 
technological  coefficients. 

Other  classes  of  matrices  exhibit  a  natural  diagonal  dominance  permitting 
effective  B-pivot  inversion.  For  instance,  the  noise  covariance  matrix  has 
been  proposed  for  such  use  in  [21],  which  includes  an  extensive  group 
theoretic  characterization  of  admissability.  Also,  algorithms  have  been 
proposed  for  rearranging  matrices  to  block  angular  form  [32]. 

The  accuracy  of  B-pivots  may  be  improved  by  computing  inner  products 
with  extended  precision  and  special  ordered  addition.  In  some  special 
cases  of  A,  such  as  zero-one  or  sparse  conditions  as 


A  = 


A    '   0 
All  ,  - 


A    '  A 
rt21  |  22 
' . 


significant  effort  can  be  avoided  by  the  B-pivot  approach  coupled  with  appro- 
priate modifications  of  program  logic.  Several  approaches  to  avoiding  B-pivot 
singularity  failures  have  been  suggested,  including  "dynamic"  blocks,  "gang" 
blocks,  pivot  block  selection  procedure,  and  a  matrix  construction  to  be 
performed  concurrently  with  a  scaling  algorithm  as  in  [9].  Fortunately, 
none  has  been  necessary  in  applications  to  date. 

Continuing  research  focuses  on  block  triangulation  and  dynamic  factor- 
ization schemes  for  large  scale  mathematical  programming,  combined  with 
generalized  upper  bounding  in  a  hybrid  system  with  both  explicit  and  logically 
generated  elements.  It  is  becoming  increasingly  clear  from  experiments  such 
as  this  that  the  (logical)  algorithm  and  data  representation  of  the  program, 
and  the  (physical)  organization  of  computations  performed  on  the  host  computer 
under  operating  system  control  interact  in  subtle  ways  to  give  aggregate  per- 
formance which  is  often  counter-intuitive  and  seldom  improved  by  ignoring  the 
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details  of  either.  The  effect  can  be  so  pronounced  that  we  are  reexamining 
several  classical  techniques  in  this  new  light. 
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